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Abstract 

The problem of supervised classification (or discrimination) with functional data is 
considered, with a special interest on the popular fe-nearest neighbors (fc-NN) classifier. 

First, relying on a recent result by Cerou and Guyader (2006), we prove the consis- 
tency of the fc-NN classifier for functional data whose distribution belongs to a broad 
family of Gaussian processes with triangular covariance functions. 

Second, on a more practical side, we check the behavior of the fc-NN method when 
compared with a few other functional classifiers. This is carried out through a small 
^ simulation study and the analysis of several real functional data sets. While no global 

^ "uniform" winner emerges from such comparisons, the overall performance of the fe-NN 

method, together with its sound intuitive motivation and relative simplicity, suggests 
that it could represent a reasonable benchmark for the classification problem with 
functional data. 
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1. Introduction 



1.1 Some background on supervised classification 

Supervised classification is the modern name for one of the oldest statistical problems 
in experimental science: to decide whether an individual, from which just a random mea- 
surement X (with values in a "feature space" endowed with a metric D) is known, either 
belongs to the population Pq or to Pi. For example, in a medical problem Pq and Pi could 
correspond to the group of "healthy" and "ill" individuals, respectively. The decision must 
be taken from the information provided by a "training sample" = {{Xi, Yi),l < i < n}, 
where Xj, i = 1, . . . ,n, are independent replications of X, measured on n randomly chosen 
individuals, and Yi are the corresponding values of an indicator variable which takes values 
or 1 according to the membership of the i-th individual to Pq or Pi. Thus the mathematical 
problem is to find a "classifier" gn{x) = gn{x', Xn)-, with gn '■ J-" {0, 1}, that minimizes the 
classification error P{gn{X) ^ y}. 

The term "supervised" refers to the fact that the individuals in the training sample are 
supposed to be correctly classified, typically using "external" non statistical procedures, so 
that they provide a reliable basis for the assignation of the new observation. This problem, 
also known as "statistical discrimination" or "pattern recognition" , is at least 70 years old. 
The origin goes back to the classical work by Fisher (1936) where, in the d-variate case 
T = M'', a simple "linear classifier" gn{x) = l{x:w'x+wo>o} was introduced {1a stands for the 
indicator function of a set A C JF). 

A deep insightful perspective of the supervised classification problem can be found in 
the book of Devroye et al (1996). Other useful textbooks are Hand (1997) and Hastie et al. 
(2001). All of them focus on the standard multivariate case JF = M'^. 

It is not difficult to prove (e.g., Devroye et al., 1996, p. 11) that the optimal classification 
rule (often called "Bayes rule") is 



where ri{x) = E{Y\X = x). Of course, since rj is unknown the exact expression of this rule is 
usually unknown, and thus different procedures have been proposed in order to approximate 
it. In particular, it can be seen that Fisher's linear rule is optimal provided that the con- 
ditional distributions of X\Y = and X\Y = 1 are both normal with identical covariance 
matrix. While these conditions look quite restrictive, and it is straightforward to construct 
problems where any linear rule has a poor performance. Fisher's classifier is still by far the 
most popular choice among users. 

A simple non-parametric alternative is given by the fc-nearest neighbors (/c-NN) method 
which is obtained by replacing the unknown regression function r]{x) in ([T]) with the regression 
estimator 



g*{x) = l{^(x)>i/2}, 



(1) 




(2) 



i=l 



2 



where A; = A;„ is a given (integer) smoothing parameter and "Xj G k{x)" means that 
is one of the k nearest neighbours of x. More concretely, if the pairs (Xj,Fj)i<i<„ are re- 
indexed as (-'^(i), ^(i))i<i<n so that the X(i)'s are arranged in increasing distance from x, 
D{x,X^i)) < L)(x,X(2)) < . . . < D{x,X^n)), then k{x) = {X^i), 1 < i < k}. This leads to 
the A;-NN classifier gn{x) = l{r,„(x)>i/2}- 

It is well-known that, in addition to this simple classifier, several other alternative meth- 
ods (kernel classifiers, neural networks, support vector machines,...) have been developed 
and extensively analyzed in the latest years. However, when used in practice with real data 
sets, the performance of Fisher's rule is often found to be very close to that of the best one 
among all the main alternative procedures. On these grounds, Hand (2006) has argued in 
a provocative paper about the "illusion of progress" in supervised classification techniques. 
The central idea would be that the study of new classification rules often fails to take into 
account the structure of real data sets and it tends to overlook the fact that, in spite of the 
its theoretical limitations, Fisher's rule is quite satisfactory in many practical applications. 
This, together with its conceptual simplicity, explains its popularity over the years. 

1.2 The purpose and structure of this paper 

We arc concerned here with the problem of (binary) supervised classification with func- 
tional data. That is, we consider the general framework indicated above but we will assume 
throughout that the space (JF, D) where the random elements X^ take values is a separable 
metric space of functions. For some theoretical results (Theorem 2) we will impose a more 
specific assumption by taking T as the space C[a, b] of real continuous functions defined in 
a closed finite interval [a.b], with the usual suprcmum norm || ||oo. 

The study of discrimination techniques with functional data is not as developed as the 
corresponding finite-dimensional theory but, clearly, is one of the most active research topics 
in the booming field of functional data analysis (FDA). Two well-known books including 
broad overviews of FDA with interesting examples are Ferraty and Vieu (2006) and Ramsay 
and Silverman (2005). Other recent more specific references will be mentioned below. 

There are of course several important differences between the theory and practice of 
supervised classification for functional data and the classical development of this topic in 
the finite-dimensional case, where typically the data dimension d is much smaller than the 
sample size n (the "high-dimensional" case where d is "large" , and usually d > n, requires 
a separate treatment). A first important practical difference is the role of Fisher's linear 
discriminant method as a "default" choice and a benchmark for comparisons. As we have 
mentioned, this holds for the finite dimensional cases with "small" values of d, but it is not 
longer true if functional (or high-dimensional) data are involved. To begin with, there is no 
obvious way to apply in practice Fisher's idea in the infinite-dimensional case, as it requires 
to invert a linear operator which is not in general a straightforward task in functional spaces; 
see, however, James and Hastie (2001) for an interesting adaptation of linear discrimination 
ideas to a functional setting. Then, the question is whether there exists any functional 
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discriminant method, based on simple ideas, which could play a reference role similar to 
that of Fisher's method in the finite dimensional case. The results in this paper suggest (as 
a partial, not definitive, answer) that the /c-NN method could represent a "default standard" 
in functional settings. 

Another difference, particularly important from the theoretical point of view, concerns 
the universal consistency of the /c-NN classifier. A classical result by Stone (1977) establishes 
that in the finite-dimensional case (with Xi e R*^) the conditional error of the A;-NN classifier 

L„ = P{gn{X) ^ Y\Xn}, (3) 

converges in probability (and also in mean) to that of the Bayes (optimal) rule g* , that 
is, E{Ln) ^ L* = P{g*{X) ^ Y}, provided that A;„ — cxd and kn/n ^ as n — oo. 
This result holds universally, that is, irrespective of the distribution of the variable {X, Y). 
The interesting point here is that this universal consistency result is no longer vahd in the 
infinite-dimensional setting. As recently proved by Cerou and Guyader (2006), if the space 
J- where X takes values is a general separable metric space, a non-trivial condition must 
be imposed on the distribution of {X, Y) in order to ensure the consistency of the /c-NN 
classifier. 

The aim of this paper is twofold, with a common focus on the A;-NN classifier and in close 
relation with the above mentioned two differences between the classification problem in finite 
and infinite settings. First, on the theoretical side, we have a further look at the consistency 
theorem in Cerou and Guyader (2006) by giving concrete non-trivial examples where their 
consistency condition is fulfilled. Second, from a more practical viewpoint, we will carry 
out numerical comparisons (based both on Monte Garlo studies and real data examples) to 
assess the performance of different functional classifiers, including /c-NN. 

This paper is organized as follows. In Section 2 the consistency of the functional /c-NN 
classifier is established, as a consequence of Theorem 2 in Cerou and Guyader (2006), for a 
broad class of Gaussian processes. In Section 3 other functional classifiers recently considered 
in the literature are introduced and briefiy commented. They are all compared through a 
simulation study (based on two different models) as well as six real data examples, very 
much in the spirit of Hand's (2006) paper, where the performance of the classical Fisher's 
rule was assessed in terms of its discrimination capacity in several randomly chosen data 
sets. 

2. On the consistency of the functional /c-NN classifier 

In the functional classification problem several auxiliary devices have been used to over- 
come the extra difficulty posed by the infinite dimensional nature of the feature space. They 
include dimension reduction techniques (e.g., James and Hastie 2001, Freda et al. 2007), 
random projections combined with data-depth measures projections use of data-depth mea- 
sures (Cuevas et al. 2007) and different adaptations to the functional framework of several 
non-parametric and regression-based methods, including kernel classifiers (Abraham et al. 
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2006, Biau et al. 2005, Ferraty and Vieu 2003), reproducing kernel procedures (Preda 2007), 
logistic regression (Miiller and Stadtmiiller 2005) and multilayer perceptron techniques with 
functional inputs (Ferre and Villa 2006). 

2.1 On the consistency of the functional k-NN classifier 

The functional /c-NN classifier belongs also to the class of procedures adapted from the 
usual non-parametric multivariate setup. Nevertheless, unlike most of the above mentioned 
functional methodologies, the fc-NN procedure works according to exactly the same principles 
in the finite and infinite-dimensional cases. It is defined by gn{x) = l{j7„(x)>i/2}7 where 
rjn is the /c-NN regression estimator (|2]), whose definition is formally identical to that of 
the finite-dimensional case. The intuitive interpretation is also the same in both cases. 
No previous data manipulation, projection or dimension reduction technique is required in 
principle, apart from the discretization process necessarily involved in the practical handling 
of functional data. In the present section we offer some concrete examples where the fc-NN 
functional classifier is weakly consistent. As we have mentioned in the previous section, this 
is a non-trivial point since the fc-NN classifier is no longer universally consistent in the case 
of infinite-dimensional inputs X. 

Throughout this section the feature space where the variable X takes values is a separable 
metric space (JF, D). We will denote by Px the distribution of X defined by Px{B) = P{X G 
B} for B G Bjr, where Bjr are the Borel sets of JF. 

Let us now consider the following regularity assumption on the regression function ri{x) = 



d^iJ t-X{-Ox,8) JBx.s 

where B^^s '■= {z & J-" : D{x, z) < 6} is the closed ball with center x and radius 6. 
Under (BC) Cerou and Guyader (2006, Th. 2) get the following consistency result. 

Denote by Ln and L* , respectively, the conditional error associated with the above defined 
k-NN classifier and the Bayes (optimal) error for the problem at hand. If (JF, D) is separable 
and condition (BC) is fulfilled then the k-NN classifier is weakly consistent, that is E{Ln) — > 
L* , as n —>■ oo, provided that k ^ oo and k/n 0. 

Besicovich condition plays an important role also in the consistency of kernel rules (see 
Abraham et al. 2006). 

Cerou and Guyader (2006) have also considered the following more convenient condition 
(called Px-continuity) that ensures (BC): For every e > and for Px-a-e. x G JF 



E{Y\X = x) 

(BC) Besicovitch condition: 




lim Px{-2 G JF : |r^(^) - r]{x) \ > e\D{x, z) < 6} = 0. 
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However, for our purposes, it will be sufficient to observe that the continuity (Px-a.e.) of 
r]{x) implies also (BC). We are interested in finding famihes of distributions of (X, Y) under 
which the regression function rj{x) is continuous (Px-a.e.) and hence (BC) holds. 

From now on we will use the following notation. Let /Xj be the distribution of X condi- 
tional onY = i, that is, fii{B) = P{X G B\Y = i}, for B G Bj^ and i = 0,1. We denote by 
Si C J-' the support of /Xj, for i = 0,1, and S = SoHSi. The expression /iq « jii will denote 
that is absolutely continuous with respect to jii. Also we will assume that p = P{Y = 0} 
fulfills p G (0,1). 

The following theorem shows that the property of continuity (resp. Px-continuity) of 
ri{x), and hence the weak consistency of the fc-NN classifier, follows from the continuity 
(resp Pjc-continuity) of the Radon-Nikodym derivative of fiQ with respect to /^i provided 
that it exists. 



Theorem 1: Assume that Px{dS) — and that /iq « /ii and /ii « /iq on S. Then the 
following inequality holds for Px-a.e. x,z ^ T . 



p 

\r\{z) — r\{x)\ < 



P 



djii dfXi 



where diio/diii denotes the Radon-Nikodym derivative of iiq with respect to /xi. When Sq = 
Si — S the assumption Px{dS) — may be dropped. 

In particular, rj is continuous Px-a-c. (resp. Px-continuous) whenever dfio/d/Ji is con- 
tinuous Px-a.e. (resp. Px-continuous). Of course, a similar result holds by interchanging 
the sub-indices and 1 and replacing p by 1 — p. 

Proof: Define /X = + /xi- Then /Xj << /i, for i = 0, 1, and we can define the Radon- 
Nikodym derivatives fi = dfii/d/i, for i = 0,1. From the definition of the conditional 
expectation we know that ri{x) = E{Y\X = x) = P(Y = 1\X = x) can be expressed by 

( ] = /i(a^)(l -p) ... 
''^"'^ fo{x)p + fi{x){l - pY ^' 

Observe that uls^nSi^ l^i\s'^nSi and thus fils^nSi^ Is-^nSf, for i = 0, 1. Since /xq « A«i and 
/Ji << fiQ on S then, on this set, we can define the Radon-Nikodym derivatives djiQ/djii and 
d/ii/d/io. In this case, it also holds that ii\s« A^ils, for both i = 0, 1 and 

-j—{x) = 1-1 — ^^^^(a;) for any x E S. 
d/ii d/ii 

Then (see, e.g., FoUand 1999), for i = 0, 1 and for P^-a.e. a; G -S", 

dni , , f dn , ,\ ^ 1 



f.(x) = ^{x) = ^{x) = (5) 
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Substituting ^ into expression Q we get 



ri{x) 



if X G ^0 n 

1 iixeSiHS^ 

1 — p 



ii X E S. 



(6) 



P 



Using this last expression we can see that if Px{dS) = and if dfio/dfii is continuous Px-a.e. 
(resp. Px-continuous) on S then r] is also continuous Px-a-e. (resp. Px-continuous) on S. 
To see this it suffices to observe that, for Px-a.e. x,z E int(S'), 



1 — p 



1 — p 



< 



p 



1 — p 



p 



dfiQ 
djjLi 



dj^Q 
djji 



To derive the last inequality we have used that, as fj,i, i 
Radon-Nikodym derivative dfi^/dfii is also non- negative. 



0, 1, are positive measures, the 



□ 



In order to be able to combine Theorem 1 and the consistency result in Cerou and 
Guyader (2006, Th. 2), we are interested in finding distributions /io,/ii of an infinite- 
dimensional random element X such that /xq << /^i and /ii << /xq with continuous Radon- 
Nikodym derivatives. Measures /iq and /xi satisfying that /iq << /ii and fii « fio on S are 
said to be equivalent on S. 

Let us denote by {C[a,b], \\ ||oo) the metric space of continuous real-valued functions x 
defined on the interval [a,b], endowed with the supremum norm, ||a;||oo = sup{|a;(t)| : t G 
[a, b]}. Also let C^[a, b] be the space of twice continuously differentiable functions defined on 
[a,b]. 

In the next theorem we show a broad class of Gaussian processes fulfilling the conditions 
of Theorem 2 in Cerou and Guyader (2006). Thus the consistency of the fc-NN classifier 
is guaranteed for them. A key element in the proof are the results by Varberg (1961) 
and J0rsboe (1968) providing explicit expressions for the Radon-Nikodym derivative of a 
Gaussian measure with respect to another one. From the gaussianity assumption, the model 
is completely determined by giving the mean and covariance functions. For the sake of 
a more clear and systematic presentation the statement is divided into three parts: The 
first one applies to the case where the mean function in both functional populations, with 
distributions fiQ and fii (corresponding to X\Y = and X\Y = 1), is common and the 
difference between both processes lies in the covariance functions (which however keep a 
common structure). The second part considers the dual case where the difference lies in 
the mean functions and the covariance structure is common. Finally, the third part of the 
theorem generalizes the previous two statements by including the case of different mean and 
covariance functions. 
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Theorem 2: Let {T,D) = {C[a,b], 



with < a < b < oo. 



a) Assume that X\Y = i, for i = 0, 1, are Gaussian processes on [a, h], whose mean function 
is zero and with covariance functions rj(s, t) = Ui{mm{s, t)) Vi{max{s, t)), for s,t E [a, b], 
where Ui,Vi, for i = 0, 1, are positive functions in C^[a, b]. Assume also that Vi, for i — 
0, 1, and viu'i — uiv'i are bounded away from zero on [a, b], that uiv'-^ — u'-^vi — uqVq — u'qVq 
and that Ui{a) — if and only ifuo{a) — 0. Then d/io/d/ii is continuous on T . 

b) Assume that X\)f — i, for i = 0, 1, are Gaussian processes on [a, 6], with equal covariance 
function r{s,t) — u{mm{s,t))v{max{s,t)), for s,t & where u,v E C^[a,b] are 

positive functions and v andvu' — uv' are bounded away from zero on [a,b]. Assume also 
that the mean function of X\Y = 1 is and that of X\Y = is a function m e C^[a,b], 
such that m{a) = whenever u{a) = 0. Then duo/d/j.^ is continuous on T . 

c) Assume that X\Y = i, for i = 0,1, are Gaussian processes on [a, b], with mean functions 
rrii G C'^[a,b] and covariance functions rj(s,t) = Ui{min{s,t)) Vi{max{s,t)) , for s,t G 
[a,b], where Ui,Vi, for i = 0,1, are positive functions in C^[a,b] which fulfill the same 
conditions imposed in (a). Assume also that mi{a) — whenever Ui{a) ~ 0. Then 
dfXo/d/ii is continuous on T . 

Therefore, under the assumptions in either (a), (b) or (c), the k-NN classifier discriminating 
between /lo and /ii is weakly consistent when A; — > oo and k/n ^ 0. 

Proof: 

a) Varberg (1961, Th. 1) shows that, under the assumptions of (a), /Iq and /ii are equivalent 
measures and the Radon-Nikodym derivative of /Iq with respect to //i is given by 



- — [x) = Ci exp 



C2x\a) + / f{t)d 



x^{t) 



vo{t)vi{t) 



(7) 



where 



-»() {a)vi (b) 
vo{b)vi{a) 



/ ui(a)vi{b) \ 
yvo{b)uo{a) J 



1/2 



1/2 



if Mo (a) = 
if Uo{a) 7^ 



Co 



iiuo(a)^0 

VQ (a)-uo (g) -Ml (a)f 1 (g) 
VI {a)vo {a)uo (a)ui (a) 



1/2 



if Mo (a) 7^ 



and 

vi{s)u\{s) - ui{s)v[{s) 

Observe that, by the assumptions of the theorem, this function / is differentiable with 
bounded derivative. Thus / is of bounded variation and it may be expressed as the 
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difference of two bounded positive increasing functions. Therefore the stochastic integral 
([7]) is well defined and it can be evaluated integrating by parts, 



- — [x) = Ciexp 



2 V J a VQ{t)Vi[t) 



with C3 = C2 — f{a)/vo{a)vi{a) and C4 = f{b)/vQ{b)vi{b). It is clear that this derivative 
is a continuous functional of x with respect to the supremum norm. 

Now, Theorem 1 implies that ri{x) is continuous and, therefore, Besicovich condition 
(BC) holds and, from Theorem 2 in Cerou and Guyader (2006), the fc-NN classifier is 
weakly consistent. Note that the equivalence of /xq and /xi implies the coincidence of both 
supports Sq = Si = S . 

b) In J0rsboe (1968), p. 61, it is proved that, under the indicated assumptions, Hq and /xi 
are equivalent measures with the following Radon-Nikodym derivative 



where 



and 



_ m^{a) _ m{a) 

2u[a)v{a) u[a)v[a) 

^^^^ _ v{t)m'{t) - m{t)v'{t) 



v{t)u'{t) - u{t)v'{t) 

Again, the integration by parts gives 

'X) = exp + - 2 x{a) + 2 4| .(6) - 2 44 dgH) } , (8) 



with 



Thus dfio/dfii, and hence 77, are continuous and the consistency of the fc-NN classifier 
holds also in this case. 

c) Let us denote by Pm,r the distribution of the Gaussian process with mean m and covari- 
ance function T. Then ^{x) is continuous since (see e.g. FoUand 1991) 

— (x) = {x) = — (x)— — (x)— (x), (9) 

and, as we have shown in the proofs of (a) and (b), the Radon-Nikodym derivatives in 
the right-hand side of (|9]) are all continuous. □ 



Remark 1 (Application to the Ornstein-Uhlenbeck processes). Let X\Y = i, 
for i = 0,1, be Gaussian processes on [a, b], with zero mean and covariance function Ti{s, t) = 
erf exp(— — t|), for s,t E [a,b], where Pi,<7i > for i = 0, 1. Assume that crfPi = (Tq^q. 
Then these processes satisfy the assumptions in Theorem 2(a). 

Remark 2 (Application to the Brownian motion). Theorem 2(b) can also be used 
to consistently discriminate between a Brownian motion without trend (mo = 0) and another 
one with trend (mi 7^ 0). It will suffice to consider the case where u{t) = t and v = 1. 

Remark 3 (On triangular covariance functions). Covariance functions of type 
T{s,t) = M(min(s, t)) f (max(s, t)), called triangular, have received considerable attention in 
the literature. For example, Sacks and Ylvisaker (1966) use this condition in the study 
of optimal designs for regression problems where the errors are generated by a zero mean 
process with covariance function K{s,t). It turns out that the Hilbert space with reproducing 
kernel K plays an important role in the results and, as these authors point out, the norm of 
this space is particularly easy to handle when K is triangular. On the other hand, Varberg 
(1964) has given an interesting representation of the processes X{t), < t < b, with zero 
mean and triangular covariance function by proving that they can be expressed in the form 



where W is the standard Wiener process and R = R{t, u) is a function, of bounded variation 
with respect to u, defined in terms of K. 

Remark 4 (On plug-in functional classifiers). The explicit knowledge of the con- 
ditional expectation (|6]) in the cases considered in Theorem 2 could be explored from the 
statistical point of view as they suggest to use "plug-in" classifiers obtained by replacing 
ri{x) in ([T| with suitable parametric or semiparametric estimators. 

Remark 5 (On equivalent Gaussian measures and their supports). According 
to a well-known result by Feldman and Hajek, for any given pair of Gaussian processes, 
there is a dichotomy in such a way that they are either equivalent or mutually singular. 
In the first case both measures /io and yUi have a common support S so that Theorem 1 is 
applicable with S = Sq = Si. As for the identification of the support, Vakhania (1975) has 
proved that if a Gaussian process, with trajectories in a separable Banach space JF, is not 
degenerate (i.e., then the distribution of any non-trivial linear continuous functional is not 
degenerate) then the support of such process is the whole space J^. Again, expression Q of 
the regression functional 77 suggests the possibility of investigating possible nonparametric 
estimators for the Radon- Nikodym derivative dfio/dfii which would in turn provide plug-in 
versions of the Bayes rule g*{x) = l{r,(x)>i/2} with no further assumption on the structure of 
the involved Gaussian processes, apart from their equivalence. 
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3. Some numerical compcirisons 

The aim of this section is to compare (numerically) the performance of several supervised 
functional classification procedures already introduced in the literature. The procedures are 
the /c-NN rule, computed both with respect to the supremum norm || \\^ and the norm 
II II 2, and other discrimination rules reviewed in Section 3.1. One of the objectives of this 
numerical study is to have some insight into which classification procedures perform well 
no matter the type of functional data under consideration and could thus be considered a 
sort of benchmark for the functional discrimination problem. Section 3.2 contains a Monte 
Carlo study carried out on two different functional data generating models. In Section 3.3 
we consider six functional real data sets taken from the hterature. 

3.1 Other functional classifiers 

Here we will review other classification techniques that have been used in the literature 
in the context of functional data. Prom now on we denote by (ti, . . . , tN) the nodes where 
the functional predictor X has been observed. 

Partial Least Squares (PLS) classification 

Let us first describe the procedure in the context of a multivariate predictor X. PLS 
is actually a dimension reduction technique for regression problems with predictor X and 
a response Y (which in the case of classification takes only two values, or 1, depending 
on which population the individual comes from). The dimension reduction is carried out 
by projecting X onto an lower dimensional space such that the coordinates of the projected 
X, the PLS coordinates, are uncorrelated to each other and have maximum covariance with 
Y . Then, if the aim is classification. Fisher's linear discriminant is applied to the PLS 
coordinates of X (see Barker and Rayens 2003, Liu and Rayens 2007). In the case of a 
functional predictor X (see Preda et al. 2007), the above described procedure is applied 
to the discretized version of X, X = (X(ti), X(t2), • • • ,X{tN)). Here we have chosen the 
number of PLS directions, among the values 1,. . . ,10, by cross-validation. 

Reproducing Kernel Hilbert Space (RKHS) classification 

We will also define this technique initially for a multivariate predictor X. For simplicity, 
we will assume that X takes values in [0, 1]^. Let k be a function defined on [0, 1]^ x [0, 1]^. A 
RKHS with kernel k is the vector space generated by all finite linear combinations of functions 
of the form ^t* (■) = '^(t*, •) • ^r any t* G [0, 1]^, and endowed with the inner product given by 
(Kt*, Kt")^ = K,(t*,t**). RKHS are frequently used in the context of Machine Learning (see 
Evgeniou et al. 2002, Wahba 2002); for their applications in Statistics the reader is referred 
to the monograph of Berlinet and Thomas- Agnan (2004). In this work we use the Gaussian 
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kernel fi;(s,t) = exp(— ||s — t||2/cT^), where > is a fixed parameter. The classification 
problem is solved by plugging a regression estimator of the type ?7„(x) = ^"^j^ q fi;(x, Xj) 
into the Bayes classifier. When X is a random function, this procedure is applied to the 
discretized X. The parameters q, for i = 1, . . . ,n, are chosen to minimize the risk functional 
X]r=i(^« Vn{^i)y + ^(^) v)k, where A > is a penalization parameter. In this work the 
values of the parameters A and cr^ have been chosen by cross-validation via a leave-one-out 
procedure. According to our results, it seems that the performance the RKHS methodology 
is rather sensitive to changes in these parameters and even to the starting point of the 
leave-one-out procedure mentioned. 

Classification via depth measures 

The idea is to assign a new observation x to that population, Pq or Pi, with respect 
to which X is deeper (see Ghosh and Chaudhuri 2005, Cuevas et al. 2007). Prom the five 
functional depth measures considered by Cuevas et al. (2007) we have taken the /i-mode 
depth and the random projection (RP) depth. 

Specifically, the h-raodc depth of x with respect to the population given by the random 
element X is defined as fh{x) = E{Kh{\\x — -^||2)), where Kfi{-) = h^^K{-/h), i^^ is a kernel 
function (here we have taken the Gaussian kernel K{t) = \/2/tt exp(— 1^/2)) and /i is a 
smoothing parameter. As the distribution of X is usually unknown, in the simulations we 
actually use the empirical version of fh, fh{,x) = Yl'i=i ^h{\\x — Xi\\2). The smoothing 
parameter has been chosen as the 20 percentile in the distances between the functions in 
the training sample (see Cuevas et al. 2007). 

To compute the RP depth the training sample Xi, . . . , X^ is projected onto a (functional) 
random direction a (independent of the Xi). The sample depth of an observation x with 
respect to Pi is defined as the univariate depth of the projection of x onto a with respect 
to the projected training sample from Pj. Since a is a random element this definition leads 
to a random measure of depth, but a single representative value has been obtained by 
averaging these random depths over 50 independent random directions (see Cuevas and 
Praiman 2008 for a certain theoretical development of this idea). If we are working with 
discretized versions (.T(ti), . . . ^xit^)) of the functional data x{t), we may take a according 
to a uniform distribution on the unit sphere of M^. This can be achieved, for example, 
setting a = Z/\\Z\\, where Z is drawn from standard Gaussian distribution on M^. 

Moving window rule 

The moving window classifier is given by 

^ - j ^ if XlILl l{y,=0,X,eB(x,/i)} > Yli=l '^{Yi=l,X,GB{x,h)}, 

9nKx) - j ^ otherwise, 

where /i = /i„ > is a smoothing parameter. This classification rule was considered in the 
functional setting, for instance, by Abraham et al. (2006). In this work the parameter h has 
been chosen again via cross-validation. 
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3.2 Monte Carlo results 

In this section we study two functional data models already considered by other authors. 
More specifically, in Model 1, similar to one used in Cuevas et al. (2007), X\Y = i is a 
Gaussian process with mean rriiit) = 30 (1 — t)^'^' t^'^ ' and covariance function rj(s,t) = 
0.25exp(— |s — 1|/0.3), for i = 0,1. Observe that this model with smooth trajectories satisfies 
the assumptions in Theorem 2 and thus we would expect the k-NN classification rule (with 
respect to the || ||oo norm) to perform nicely. Let us note that the value of 1.1 in the exponent 
of niiit) is in fact the one used in Model 1, pg. 487, of Cuevas et al. (2007), although in 
their work a 1.2 was misprinted instead. 

Model 2 appears in Preda et al. (2007), but here the functions hi, used to define the 
mean, have been rescaled to have domain [0, 1]. The trajectories oi X\Y — i are given by 

Xi{t) = U hi{t) + {l-U) hi+2{t) + e(t) for 2 = 0, 1, (10) 

where U is uniformly distributed on [0, 1], hi{t) = 2 max(3 — 5|2t — 1|, 0), h2{t) = hi{t — l/5), 
hsit) = hi{t + 1/5) and the e(t) is an approximation to the continuous-time white noise. In 
practice, this means that in the discretized approximations {X{ti), . . . ,X{tN)) to X{t), the 
variables e(ii), . . . , e(tiv) are independently drawn from a standard normal distribution. 

The simulation results are summarized in Tables 1 and 2. The number of equispaced 
nodes where the functional data have been evaluated is the same for both models, 51. The 
number of Monte Carlo runs is 100. In every run we generated two training samples (from 
X\Y = and X\Y = 1 respectively) each with sample size 100, and we also generated a 
test sample of size 50 from each of the two populations. The tables display the descriptive 
statistics of the proportion of correctly classified observations from these test samples. 





A;-NN|oo 


A;-NN|2 


PLS 


RKHS 


h-modal 


RP(hM) 


MWR 


Minimum 


0.6200 


0.6600 


0.6000 


0.4800 


0.6400 


0.5400 


0.6600 


First quartile 


0.8000 


0.8000 


0.8000 


0.6600 


0.8000 


0.7800 


0.8000 


Median 


0.8400 


0.8400 


0.8400 


0.8400 


0.8400 


0.8400 


0.8400 


Mean 


0.8396 


0.8354 


0.8371 


0.7999 


0.8409 


0.8260 


0.8393 


Third quartile 


0.8800 


0.8800 


0.8800 


0.9400 


0.8800 


0.8800 


0.8800 


Maximum 


0.9800 


0.9600 


0.9800 


1.0000 


0.9800 


0.9800 


1.0000 


Std. deviation 


0.0603 


0.0572 


0.0668 


0.1457 


0.0589 


0.0725 


0.0634 



Table 1: Simulation results for Model 1 

Regarding Model 1, observe that there is httle difference between the correct classification 
rates of any of the methods, except for the RKHS procedure which performs worse. In 
Model 2 the PLS, RKHS and /i-modal methods shghtly outperform the others. When the 
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fe-NN|oo 


fc-NN|2 


PLS 


RKHS 


/i-modal 


RP(hM) 


MWR 


Minimum 


0.8400 


0.8400 


0.8800 


0.8400 


0.8600 


0.8400 


0.8200 


First quartile 


0.9200 


0.9400 


0.9600 


0.9600 


0.9400 


0.9400 


0.9400 


Median 


0.9600 


0.9600 


0.9800 


0.9800 


0.9800 


0.9600 


0.9600 


Mean 


0.9522 


0.9558 


0.9686 


0.9688 


0.9657 


0.9522 


0.9570 


Third quartile 


0.9800 


0.9800 


0.9800 


1.0000 


1.0000 


0.9800 


0.9800 


Maximum 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


Std. deviation 


0.0335 


0.0355 


0.0279 


0.0313 


0.0308 


0.0345 


0.0349 



Table 2: Simulation results for Model 2 



Monte Carlo study with this model was carried out, we also applied the A;-NN classification 
procedures to a spline-smoothed version of the X trajectories. The result was that the mean 
correct classification rate increased to 0.9582 in the case of the supremum norm and to 
0.9624 in the case of the norm. This, together with the analysis of the flies data in the 
next subsection, seems to suggest that, when the curves X are irregular, smoothing these 
functions will enhance the /c-NN discrimination procedure. 

3.3. Some comparisons based on real data sets 

3.3.1. Brief description of the data sets 

Berkeley Growth Data: The Berkeley Growth Study (Tuddenham and Snyder 1954) recorded 
the heights of uq = 54 girls and ni = 39 boys between the ages of 1 and 18 years. Heights 
were measured at 31 ages for each child. These data have been previously analyzed by 
Ramsay and Silverman (2002). 

ECG data: These are electrocardiogram (ECG) data, studied by Wei and Keogh (2006), 
from the MIT-BIH Arrhythmia database (see Goldberger et al. 2000). Each observation 
contains the successive measurements recorded by one electrode during one heartbeat and 
was normalized and rescaled to have length 85. A group of cardiologists have assigned a 
label of normal or abnormal to each data record. Due to computational limitations, of the 
original 2026 records in the data set, we have randomly chosen only 200 observations from 
each group. 

MCO data: The variable under study is the mitochondrial calcium overload (MCO), mea- 
sured every 10 seconds during an hour in isolated mouse cardiac cells. The data come from 
research conducted by Dr. David Garci'a-Dorado at the Vail d'Hebron Hospital (see Ruiz- 
Meana et al. 2003, Cuevas, Febrero and Fraiman 2004, 2007). In order to assess if a certain 
drug increased the MCO level, a sample of functions of size Uq = 45 was taken from a control 
group and Ui = 44 functions were sampled from the treatment group. 
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Spectrometric data: For each of 215 pieces of meat a spectrometer provided the absorbance 
attained at 100 different wavelengths (see Ferraty and Vieu 2006 and references therein). 
The fat content of the meat was also obtained via chemical processing and each of the meat 
pieces was classified as low- or high-fat. 

Phoneme data: The X variable is the log-periodogram (discretized to 150 nodes) of a 
phoneme. The two populations correspond to phonemes "aa" and "ao" respectively (see 
more information in Ferraty and Vieu 2006). We have considered a sample of 100 observa- 
tions from each phoneme. 

Medflies data: This dataset was obtained by Prof. Carey from U.C. Davis (see Carey et al. 
1998) and has been studied, for instance, by Miiller and Stadtmiiller (2005). The predictor 
X is the number of eggs laid daily by a Mediterranean fruit fly for a 30-day period. The 
fly is classified as long-lived if its remaining lifetime past 30 days is more than 14 days and 
short-hved otherwise. The number of long- and short-lived flies observed was 256 and 278 
respectively. 

3.3.2. Results 

We have applied the classiflcation techniques reviewed in Section 3.1 to the real data 
sets just described. While carrying out the simulations of Subsection 3.1, we observed that 
the performance of the RKHS procedure was very dependent on the initial values of the 
parameters ax and A provided for the cross-validation algorithm. In fact, finding initial 
values for these parameters that would finally yield competitive results with respect to the 
other methods took a considerable time. Thus we decided to exclude the RKHS classification 
method from the study with real data. 

We have computed, via a cross-validation procedure, the mean correct classification rates 
attained by the different discrimination methods on the real data sets. In Table 3 we display 
the results. Since the egg-laying trajectories in the medflies data set were very irregular 
and spiky, we have computed the correct classiflcation rate for both the original data and 
a smoothed version obtained with splines. The smoothing leads to a better performance of 
the A'-NN procedure with the supremum metric, just as it happened in the simulations with 
Model 2. 

As a conclusion we would say that the /c-NN classiflcation methodology with respect to 
the L°° norm is always among the best performing ones if the X trajectories are smooth. 
The fc-NN procedure with respect to the norm and the PLS methodology give also good 
results, although the latter has the drawback of a much higher computation time. 

References 
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Data set 


A;-NN|oo 


A;-NN|2 


PLS 


/i-modal 


RP(hM) 


MWR 


Growth 


0.9462 


0.9677 


0.9462 


0.9462 


0.9462 


0.9570 


EGG 


0.9900 


0.9950 


0.9825 


0.9900 


0.8575 


0.8850 


MCO 


0.8427 


0.8315 


0.8876 


0.7640 


0.7079 


0.6854 


Spectrometric 


0.9070 


0.8558 


0.9163 


0.6791 


0.6930 


0.6558 


Phoneme 


0.7300 


0.7800 


0.7400 


0.7300 


0.7450 


0.6950 


Medflies (non-smoothed) 


0.5468 


0.5412 


0.5262 


0.4925 


0.5056 


0.5431 


(smoothed) 


0.5712 


0.5431 


0.5094 


0.5075 


0.5543 


0.5206 



Table 3: Mean correct classification rates for the real data sets 
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